We present a new method to translate videos to commands for roboticmanipulation using Deep Recurrent Neural Networks (RNN). Our framework firstextracts deep features from the input video frames with a deep ConvolutionalNeural Networks (CNN). Two RNN layers with an encoder-decoder architecture arethen used to encode the visual features and sequentially generate the outputwords as the command. We demonstrate that the translation accuracy can beimproved by allowing a smooth transaction between two RNN layers and using thestate-of-the-art feature extractor. The experimental results on our newchallenging dataset show that our approach outperforms recent methods by a fairmargin. Furthermore, we combine the proposed translation module with the visionand planning system to let a robot perform various manipulation tasks. Finally,we demonstrate the effectiveness of our framework on a full-size humanoid robotWALK-MAN.
展开▼